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ABSTRACT 

This report summr'^izes the work conducted for the 
Artificial Intelligence Measurement System (AIMS) Project which was 
undertaken as an exploration of methodology to consider how the 
effects of artificial intelligence systems could be compared to human 
performance. The research covered four areas of inquiry: (1) natural 
language processing and understanding; (2) expert systems; (3) 
machine vision and visual perception; and (4) technology assessment 
and evaluation. The four areas are discussed in turn with information 
provided regarding the goals of individual research efforts within 
each area. Comparative tests between human and computer performances 
are noted. A list of 31 project reports and 12 technical reports is 
included in the document. Neunes of project staff and consultants and 
a distribution list are appended. (DB) 
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Aitificia] Intelligence Measurement System 
Overview and Lessons Learned* 

The Artificial Intelligence Measurement Systems (AIMS) project was undertaken as an 
exploration of methodology to explore how the effects of artificial intelligence systems could be 
compared to human performance. It was designed under a number of assumptions. First, that 
human performance is infinitely richer than the relatively primitive systems so far designed. 
Although the principal measurement strategy proposed treating system performance as if were a 
point in a distributi(Hi of human performance, then; was no intention of equating conceptually 
ccNiiputer systems and individual human performance. Prior research by Qancey (1988) for 
example, documented the fact that computer systems because of their ccmsistency and dependence 
upon a coherent view (an expert) could be compared to a set of humans working on problems in a 
particular dcHnain. Rather the exploratory goal of this project was to investigate whether intelligent 
systems could be placed on a continuum of human performance. In practice, this mapping would 
test some a priori correspondences, in that relatively unsophisticated systems would be mapped on 
a sample of individuals with relatively low performance and more sophisticated systems would 
map to individuals with more sophisticated levels of performance. If such a set of rough 
correspondences could be established, then it would be theoretically possible to benchmark 
systems under development in terms of progressively higher performing populations of 
individuals. Effectiveness, in terms of a performance and investment ratio, could be judged for 
increasingly expensive implementations. As a simple example, we could imagine comparing the 
mathematics problems solved by a system with the performance of students in kindergarten, 6th 
grade, and beginning calculus. Origirally, the project was fonnulated to focus in one area-nahiral 
language understanding with the corresponding human performance domain of reading 
comprehension. This area held much promise because of (1) the rich research in both natural 

^Citation not included in the references are in the list of project reports immediately following the 
reference page. 
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language understanding and reading comprehension and (2) the clear differentiation of individuals 
in terms of dimensions underlying text understanding. However, we were encouraged to consider 
multiple areas simultaneously, natural language understanding, including interfaces and texts, 
expert system shells and expert systems, and machine vision. The project also included a 
technology assessincnt component to pemoit reflection on our processes in the light of progress 
made elsewhere. 

Another assumption of this project was that it would in pan depend upon collaboration 
with members of the con^uter science discipline. It was also assumed that this requirement 
would provide a challenge because the form of evaluation we were exploring would not be within 
the expectations or values of members of this discipline. Although we experienced difficulties in 
acquiring systems for use and in sustaining interest of some ccmiputer scientists, critical 
components of this wcxic were led or strongly influenced by members of the computer science 
community. Moreover, the project had a desired effect in energizing members of the community to 
explore approaches beyond standard software metrics to evaluate the impact of their efforts. 
The project experienced all the usual difficulties in dealing with complex software-delays in 
hardware implementations, concerns about the proprietary nature of code, as well as some 
unanticipated problems, such as the requirement but inability to evaluate systems implemented in 
classified domains. Staff also needed to quell occasional anxiety attacks related to imagined 
litigation occasioned by the public evaluation of commercial products. 

As a strategy, the project invested the bulk of its resources in the natural language area. 
There it focused on two different types of implementations: interfaces that served to query 
databases or as front ends to expert systems and experimental text understanding systems. A 
principal effort in this project component was the development of a compatible 
descriptive/empirical strategy. The creaticm of a sourcebodc of problems in natural language 
(Read, Dyer, Baker, Mutch, Butier, Quilici, & Reeves, 1Q90) was undertaken as a way to describe 
and map the field. This system could provide an interpretative context for the understanding of any 
empirical benchmarking results. Thus, the empirical benchmarking of systems could be 



understood in tenns of the difficulty of the task. A partial aralogy is th^ degree of difficulty soore 
paired with the performance score for a diver. Description was also a key element in the other 
project components as well, although no where was the efTort as extensive as in the natural 
language understanding tasks. The machine vision project also created a sourcebook of problems 
(Skrzypek, Mesrobian, & Gungner, April 1988) and described existing vision systems and 
measures (Skrzypek, Mesrobian, & Gungner, March 1988). The expert system project created a 
framework for both expert systems and analogous human processes. 

The empirical, human benchmarking strategy was predicated on the idea that existing tests 
would be available for administration, and that these existing, conunerdally available or research 
validated achievement tests would allow the benchmaridng (or comparison) of multiple 
implementations. Early on in the project, it became clear that except in the area of vision, existing 
tests would be largely inappropriate because they did not reflect the domain specificity of particular 
implementations. Although linking and equating strategies are available to combine infonnation 
from disparate tests, they imposed constraints in terms of the underlying dimension to be 
measured as well as required large sample sizes. Some existing measures were used, for 
example, standardized measures of reading ability, to assess performance differences, but for the 
most part, an unanticipated efTort needed to be made in test development to create the performance 
base for comparison. This development proceeded according to strategies identified in Hively, 
Patterson, and Page (1968) and in Baker and Herman (1983) using what is known as domain 
referenced achievement tests. In the natural language area, an attempt was ntade to overcome the 
domain specifity problenL We created a measure that dissociated the structure of the query from 
its content base. This seemed to be the only approach available since we were assessing a system 
that needed to be reimplemented in each particular content domain each time it was applied, and the 
domain under development involved a classified Navy domain of information. In other test 
development, we were able to sidestep the domain issue by focusing on process, for example, the 
development of a test of metacognitive strategy described in the expert system component 



However, for much of our effort we were very much focused on the domain of task» the particular 
texts in systems or the particular content area of an expen system. 

The project explored whether human benchmarking of computer systems is possible in a 
variety of classes of systems. Our answer to that question is yes. A corollary question is whether 
benchmarking processes are routinely feasible as evaluation procedures for intelligent systems. At 
the present time» our answer is no» for the practical and technical reasons above. We recommend 
the creation of descriptive resources, such as the Sourcebook, to enable the fiekl to inform itself 
and keep abreast of the progress made by the community. Such resources could break down the 
unintentional banien created by lineages of training or location. We further recommend die 
pursuit of benchmarking when there are suf^cient implementations in a common area to support 
the investment in their common evaluation. Such evaluation would identify the differential 
emphases and effects of such systems in terms of their stated goals and in terms that program 
managers and policymakers could understand, that is, in terms of what ordinary or extraordinary 
people can and cannot do on their own. 

Natural Language Understanding 

Our research in the area of natural language understanding focused on nnethods of 
evaluating natural language processing (NLP) systems. Our goal in this area was two-fold: 

1) we were interested in the identification and classification by example of problems in 
natural language understanding, and 

2) we were interested in the development of an evaluation methodology which considers 
system output relative to or benchmarked to human performance. 

The first approach took into account the processes that lead to output; the second approach was 
concerned with output only. These two evaluation metrics can be used to describe NLP systems in 
complementary ways. Baker (1987), Read, Dyer, and Feifer (1988), and Hecht and Wittrock 



(1988) provide preliininaiy overviews of the issues addressed in the individual studies in the 
natural language understanding poition of the project 

Identification of Problems in Natural Lanpiape Understandiny 

The first approach to the issue of NLP system evaluation, that of identification by example 
and classification of problems in natural language understanding, is realized in practical form in the 
Natural Language Sourcebook (Read, Dyer, Baker, Mutch, Butler, Quilici, & Reeves. 1990). Ilfi 
Natural Language Sourcebook is a collection of 197 examples of natural language processing 
problems organized by a classification scheme which reflects an artificial intelligence perspective 
and cross-referenced by two other classification schemes, one reflecting a linguistic perspective 
and the other a cognitive-psychological perspective on the types of issues presented in the 
examples. 

The Sourcebook developmental process involved a search through the artificial intelligence, 
computational linguistics, and cognitive science literature to identify examples of processing 
problems. Each example served as the basis for a Sourcebook entry. The entries, called 
"exemplars," each consist of 1) one or more sentences, a fragment of dialogue, or a piece of text 
v'hich illustrates a conceptual issue, 2) a reference, and 3) a discussion of the problem a system 
might have in understanding the example. An example is used to il. jstrate each problem, but it is 
the discussion that defines the type of problem by delineating the information-processing issues 
involved. The Sourcebook exemplars provide discussions of concrete processing problems in 
terms of the general principles at issue. This grounding of the general in the specific makes the 
Sourcebook a uniquely useful and appropriate tool for evaluation of NLP systems. 

At two different stages, the Sourcebook underwent rigorous content review. Hrst, when 
50 exemplars had been compiled, the Sourcebook was reviewed internally at UCLA by a linguist 
and a cognitive scientist Then when 150 exemplars had been developed, the Sourcebook was sent 
for external review to experts in artificial intelligence and computer science at Carnegie Melon 
University, the University of Michigan, and the Illinois Institute of Technology. Based on 



reviewer comments at both stages, substantive revisions were made in the Souicebook* and 
additional exemplars were developed. Once the exemplars were completed, the linguistic and 
cognitive-psychological cross-indexing was added. 

Finally, an elecaonic version of the Sourcebook database was developed (Heri, August 
1990 and September 1990). This electronic HyperCard version of the Natural Langua ge 
Sourcebook capitalizes on the modular structures of the Sourcebook exemplars and facilitates use 
of the multiple classification schemes by links between specific cards (exemplars). The HyperCard 
version of the Naniral Language Sourcebook is accompanied by a user's manual (Herl, August, 
1990). 

The Sourcebook project is covered in Dyer and Read (1988) as well as in the introduction 
to the Sourcebook itself (Read et al., 1990). Th^; cognitive-psychological classification scheme 
used for cross-referencing the Sourcebook exemplars is presented in Wittrock (1989). A status 
report on the Sourcebook was presented at the ONR contractor's meeting held at Princeton 
University, March 1990 (Butler & Baker, 1990). 

An initial test of the usefulness of the Natural Language Sourcebook as a tool for 
describing and evaluating NLP systems is described in Mutch, 1990. "Hiis repon provides an 
empirical verification of the problem coverage in the Natural Language Sourcebook by referencing 
output from one intelligent computer system, IRUS, to the Sourcebook exemplars. From the 
consideration of the IRUS queries in relation to the Naniral Language Sourcebook, it appears that 
the coverage of processing problems presented in the Sourcebook is sufficiently comprehensive to 
be of practical use. 

Benfphmarking to Human Performance 

The second approach to the issue of NLP system evaluation, that of evaluating NLP 
systems by benchmarking to human performance, was explored in two major studies. The fint 
provides an initial specification of a continuum of difficulty for language a syntaaic shell interface, 
IRUS, can process (Baker, Turner, & Butler, 1990). The continuum of difficulty is based on the 
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performance of kindei:ganners and fust graders on comprehension tasks syntactically parallel to 
those accomplished by IRUS. Baker and Lindheim (1988) and Baker, Lindheim and Skizypek 
(1988) provide preliminary descriptions of the study presented in Baker et al. (1990). 

The sec(xid study provides a comparison of the abilities of six text understanding systems 
to answer specific questions about given texts with the abilities of humans to answer the same 
questions about the same texts (Butler, Baker. Falk, Heri, Jang, & Mutch, 1990). In this study, 
systems were benchmarked to grade equivalent groups of human subjects. 

In Baker et al. (1990), correct responses for the human subjects were determined by how 
IRUS responded to parallel items (i.e., all the IRUS responses were taken to be coiTect), whereas 
in Butler et al. (1990), conect responses for both human subjects and intelligent computer systems 
were determined by the consensus responses of adult native speakers. 

Baker et al. (1990) provides an initial verification of the feasibility of distinguishing 
intelligent computer system responses to natural language processing tasks by human 
developmental criteria; Butler et al. (1990) extends this initial investigation by looking at a larger 
range of human developmental stages and by actual benchmaricing of systems' overall and 
differentia] capabilities to human capabilities as they vary with development. 

Expert System Shells 

This component of the project attempted to investigate reasonable approaches to the 
evaluation of expert system shells. It attempted to explore: 

1) what methodologies available from social science might be brought to bear on the study 
of expert system shells; 

2) what was the feasibility of implementing these strategies in a routine way because of 
commercial interests in shell quality. 

This project be^u with the analysis of costs and benefits of experimental approaches to 
the study of expert systems, particularly the construction of an experiment manipulating shells and 
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tasks and assigning them to system developers with various levels of expertise. Even if critical 
variables, such as order, domain knowledge, and task generalizability could be controlled, the 
approach was rejected because of feasibility concerns-time, cost, and the small likelihood that 
system developers appropriate to represent the population of interest could be released from their 
regular tasks in order to complete our experimental requirements. 

Instead, we decided to take a different tack and assess qualitatively the process of 
knowledge engineering and system development using a case study approach. Fallowing a review 
of the literature (reported in Novak, Baker. & Slawson, 1991), the project recognized that typical 
software metrics in use for shell evaluation did not focus on in detail the processes ncr the 
outcomes of development Although our literature review did tum up studies focused on user 
satisfaction, and consumer guide sorts of analyses, in depth studies of knowledge engineering 
processes had not been made. Consequently, the project posited the idea of developing a 2x2 
design for the conduct of intensive case studies, with one factor focusing on the sophistication of 
the shell in terms of representation and inferencing strategies and the other factor focusing on the 
nanire of the problem, whether it was well defined or ill-structured. To undertake this work, a 
well defined problem, selecting the appropriate reliability index for use with a particular fonn of 
achievement test, was formulated. An expert psychometrician was identified and video tapes and 
observations of the knowledge engineering process were made. The first system employed was 
relatively unsophisticated, M-1^. The knowledge engineer had some previous domain 
knowledge and had experience in implementing other expert systems in this shell. The knowledge 
engineer prepared reports (Li, 1987; Li, 1988) and early progress in this effort was reported by 
Slawson, Novak, and Hambleton (1988). The implementation was reviewed by the expert and 
found to be unsatisfactory because of domain misconceptions by the knowledge engineer. Rather 
than proceed to completion, the expert recommended that we try something else. Principally using 
the existing videotapes and with minimal visits with die expert, another implementation of an 
expert system was made using NEXPERT^. At that point, given the difficulty and cost of this 



strategy, with the approval of our advises, we decided to focus on expert systems. The summary 
report of effort in this area is provided in Novak, Baker and Slawson, 1990. 



Benchmarking Expert Systems 

The problem of human benchmarking in an expert system context was addressed by 
research attending to the following questions: 

1) What descriptive analyses of computer expert pnx:esse$ and human cognitive 
processes should be attempted? 

2) On what dimensions could expert system performance be benchmarked on humans? 

This work was conducted in cooperation with a suboxitract to the Cognitive Science 
Laboratory of USC. The project initiated with a literature review of benchmarking of expert 
systems (O'Neil, Ni & Jacoby, 1990) in which it became clear that the project could opt to have 
computer-science driven models or psychologically driven models of benchmarking. Although it 
would be ideal to cross validate these approaches, we were constrained by the lack of availability 
of expert system implementations which would permit multiple tests of a psychological driven 
measurement model The decision was to conduct human benchmarking acconling to the 
concepmal model originally outlined in Baker (1987). that is to norm an expert system's 
performance on samples of individuals. Expert systems always involve considerable amounts of 
domain-specific knowledge, thus, unlike the IRUS work described above, it was difficult lo isolate 
the structure of tasks from content We believed however we could, through the use of metaphor, 
transform the essence of an expert system (GATES, a system that assigned airplanes to gates in 
major airline hubs) into a valid psychological construct The GATES program schedules by 
assigning an item to time, location, etc, without violating constraints. The psychological equivalent 
of this task is called self-monitoring in the literature. We surveyed extant measurement literature to 
identify an existing, high quality instrument to assess this aspect of human metacogniticMi. When 
we found no such instrument, one was developed. Thus a study was designed that incorporated 
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both the benchmarking of outcomes (how well samples of students completed the GATES tasks) 
and of human processes (how well students planned, selected strategies, and monitored their 
behavior while conducting the task, and how aware they were of their processes). The design 
methodology both in the general case and as it applied to GATES is included in the report by 
O'Neil, Ni, Jacoby, and Swigger (1990). Finally, a report of the evaluation, using both process 
and outcome measures was prepared, following the conduct of experimental trials (O'Neil, Baker, 
Jacoby, Ni, & Wittrock, 1990). The methodology was denKxistrated to be successful in that 
individuals with a priori different ability levels performed predictably. A summary of the entire set 
of activities is provided by O'Neil (1990). 

Additional outcomes for this component of the project were found. One spin-off study 
looked at the applicability of current research in sofhvare engineering, human performance 
measurement, simulation, and machine learning for the evaluation of expert systems and suggested 
incorporating some of the techniques into a formal assessment methodology. The methodology 
was then applied to the GATES system (Swigger, O'Neil, Ni, & Jacoby, 1990). A second spin- 
off study investigated the GATES task as it provided an environment for the experimental test of 
explanation facilities. In an experiment, goals, tasks, and explanation types were manipulated 
(Jacoby, 1990). Probably the most important outcome was the development of apparently highly 
reliable and valid measures of human metacognition. These measures were developed using tested 
RKxlels from the realm of personality measurement, that is, both the trait of metacognition and its 
application under particular states were measured. Trait measurement means how an individual 
normally functions whereas state measures ask for his^er retrospective report of function under 
specific conditions. These measures are currently being experimentally employed in other 
performance assessment contexts (Baker & O'Neil, 1991). They seem to have promise as 
measures of engagement and attention to complex tasks, measures with obvious application to 
military and civilian training and to educational outcome assessment in general. 
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Machine Vision 



The machine vision benchniaridng component was completed under the direction of Dr. 
Josef Skrzypek of the UCLA Computer Science Department This component sought to answer 
the following questions: 

1. As a long term goal, the project investigated how machine vision might proceed as a 
joint effort between the neurosciences and computer science. 

2. Specifically related to this project* the component sought to generate a framewoik for 
evaluating progress in machine vision by docunnenting the status of the field and investigating the 
human visual perfomiances that could be benchmarkol on a vision system? 

The strategy used for the vision benchmarking component, initially described in Baker 
(1987) and Baker. Lindheim, and Skrzypek (1988) in some ways paralleled the strategy used in 
the natural language component. Three reports provide initial exploration of the machine vision 
strategy (Mesrobian & Skrzypek, June 1987; Palk, Gungner, & Skrzypek, June 1987; and 
Skrzypek & Mesrobian, November 1987). Following a conference of experts in computer 
science, neuroscience, and psychology, the project conducted an extensive reviews of 15 vision 
systems in order to identify possible categories along which machine vision systems could be 
evaluated. In the report by Skrzypek, Mesrobian, and Gungner (March 1988), each of these 
analyses is followed by justifications for the use of the human visual system as a nxxlel for a 
general purpose vision system. The report identifies visual tasks from existing tests and discusses 
them in terms of their corresponding computational neural substrates. Comparisons among 
systei:?s are made along five dimensions: 1) image attributes; 2) perceptual primitives; 3) 
knowledge hase; 4) object representation; and 5) control Skrzypek and his colleagues rejected the 
attempt to benchmark individual vision systems direcdy. They did so for a number of reasons. 
One constraint was the idiosyncratic platforms used in the development of such systems. The cost 
of acquiring such sufficient hardware appropriately configured was well beyond the resources of 
this project Similarly, the particular domain of interest for these systems was extremely narrow. 
When approaching the problem from the human side, benchmarking ran into some limitations, in 



large measure because the bulk of existing systems focused on lower and middle lan^e visual tasks 
with minimal cognitive demands. Such tasks, were outside accessible ranges for typical 
individuals. Simple tasks were automatic, e.g., matching to samples used in manufacturing 
systems, that people had no awareness of when and how they completed such tasks and one would 
need to drop to visually impaired or individuals with specific brain dysfunctions, caused by age, 
accident, or disease. On the other end, computer image enhancement pushed beymd the limits of 
individual capability. Instead, the team decided to work in the opposite direction. They created a 
model of general purpose vision. They assembled typical visual tasks provided to individuals in 
regular psychological tests, such as paper folding and block tests, and documented neuroscience 
evidence connected to them. Finally, they created a Sourcebook (Skizypek, Mesrobian, & 
Gungner, April 1988) documenting data level visual tasks. Each entry consists of a problem 
statement, a discussion, references from the literature and exan^les. 



Technology Assessment 

A final component of this effon was the attempt to be reflective and self -conscious about 
the strategies we undertook to evaluate complex systems. These strategies involve technical, 
social, financial and policy dimensions. One integrative analysis of the problem where this project 
is used as an example was created by Baker (in press) from an invited chapter presented at a 
symposium on intelligent systems sponsored by the Air Force Human Resources Laboratory. As a 
culmination to the project, a conference was held at UCLA inviting a wide range of individuals 
from the military, academic and industrid sectors (Baker, Butler, & O^Neil, 1990). Each 
presentation was focused on cither general models for assessing technology, cumulative findings 
in an area, and particular examples. Papers written by external consultants are included in the 
report Because we are attempting to secure a commercial contraa for the publication of these and 
redrafts of project reports, we prefer to restrict their circulation at this time (Baker, Butler, &, 
0*Neil, 1991). The conference proved to be very much work-in-progress in its focus and 
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underscored the relatively little systematic thought given to the assessment (and evaluation) of 
technologies of all sorts. Qeaily, working on the boundaries among fields, computer science, 
military training, education, evaluation, and psychometrics will provide a continuing challenge. 

Summary 

The AIMS project provided documentation of explorations of the benchmaildng of 
intelligent systems on human performance. The project used both descriptive and en^nrical 
strategies and a wide range of methodologies. The project was conducted in the following areas: 
natural language understanding, expert systems, machine vision, and included a technology 
assessment component 
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2. Project Consultants 
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Artificial Intelligence Measurement System (AIMS) 
Project Staff (1986- 1990) 

The following is the list of people who served as AIMS Project Staff at different times during 
the period of the contract There was turnover from one academic year to another particularly with 
graduate students and support staff. 

Project Management 

Dr. Nancy Atwood -- Educational Psychology 

Dr. Eva Baker » Measurement; Learning and Instnicdon 

Dr. Frances Butler - Applied Linguistics 

Dr. Dayle Hartnett - Applied Linguistics; ESL instruction 

Dr. Joan Herman - Educational Evaluation; Measurement 

Dr. Elaine Lindheim - Educational Evaluation; Measurement 

Project Support Staff 

Kathleen Brennan - Woid Processor 
Rory Constancio - Office Manager 
Elizabeth Freedman - Secretarial Support 
Katherine Fiye - Administrative Assistant 
Wanctta Jones ~ Conference Coordinator 
Phyllis Kaelin - Financial Affairs 
Aeri Lee ~ Administrative Support 
Cindi Mercer Administrative Assistant 
Sally Metiy - Administrative Assistant 
Judy Miyoshi - Administrative Assistant 
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Nntiiml Lanyuiiyg llndprsfaniliny 

Faculty and Staff 

Dr. Eva Baker - Measurement; Learning and Instruction 
Dr. Prances Butler - Applied Linguistics 

Dr. Michad Dyer - Artificial Intelligence; Natural Language I^ocessing 
Dr. Barbara Hecht - Language Developnoent 
Dr. Walter Read - Artificial Intelligence; Natural Language I^nocessing 
Dr. Merlin Wittrock - Cognitive Psychology 

Graduate Students 

Tine Falk - Learning and Instruction 

Cheryl Fantum - Applied Linguistics 

Richard Feifer — Artificial Intelligence; Learning and Instruction 

Susan Ferdman - Cbniputer Science; Lcaining and Instruction 

Howard Herl ~ Social Research Methods 

Anat Jacoby - Learning and Instruction 

Younghee Jang - Learning and Instruction 

Karen Kellen - Learning and Instruction 

Emanuel Maidenberg Learning and Instruction 

Patricia Mutch - Linguistics 

Mark Neder ~ Applied Linguistics 

Alex Quilici - Artificial Intelligence 

Regie Stites - Linguistics; Anthropology 

Eileen Terran -- Speech Pathology; Counseling Psychology 

Jean Turner - Applied Linguistics 
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Yisiiui 

Faculty and Staff 

Dr. Josef Skizypek - Artificial Intelligence; Con^uter Vision 

Graduate Students 

Edmund Mesrobian - Artificial Intelligence 
David Gungner ~ Artificial Intelligence 
Paul Lin - Artificial Intelligence 
Emanuel Maidenberg - Learning and Instruction 
Eugene Paik - Artificial Intelligence 
Michael Stiber - Artificial Intelligence 

Expert ?^YStcina 

Faculty and Staff 

Dr. Eva Baker - Measurement; Learning and Instruction 

Dr. Harold F. O'Neil, Jr. -- Cognitive Science Laboratory, USC 

(Subcontract) 

Dr. Merlin WittnKk - Cognitive Psychology 

Graduate Students 

Simon Chang ~ Education 
Anat Jacoby - Learning and Instrtiction 
Yujing Ni -- Learning and Instruction 
John Novak -- Learning and Instruction 
Dean Slawson - Social Research Methods 
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Artificial InteUigenoe Measurement System 
Project Consultants 



55QurcehQQk mm 

Jaime Carbonell, Computer Science Department, Carnegie Mellon University 
Manha Evens, Computer Science Department, Dlinois Institute of Technology 
Evelyn Hatch, Appbed Linguistics Department, UCLA 
David Kieras, CoUege of Engineering, University of Michigan 
Carol Lord, Los Angeles IBM Scientific Center 
Merlin Wittrock, Graduate School of Education, UCLA 

Ic2a Understanding (1990) 

Carol Lord, Intelligent Text Processing, Inc., Santa Monica 



Expert Svstems (1987-90^ 

Ronald K. Hambleton, School of Education, University of Massachusetts 

Zhongniin Li, School of Education, University of Soutliem California 

Jason Millman, Cornell UniversiQr 

Harold F. O'Neil. Jr. (USC Subcontract) 

Elliot Soloway, Department of Computer Science. Yale University 

Kathleen Swigger, Computer Science Department. University of North Texas 



Technology Assessment (1990) 

Nancy K. Atwood, BDM International, Inc. 

John D. Bransford, Vanderbilt University 

Heniy Braun, Educational Testing Service 

Hugh Bums, University of Texas, Austin 

Richard E.Claric, USC 

William Dohcrty, BDM International, Inc. 

Wallace Feurzeig, BBN Systems and Technologies Corporation 

Susan F. Goldman, Vanderbilt University 

Jan Hawkins, Bank Street College for Children and Technology 

James Kulik, University of Michigan 

Alan Lesgold, Learning R&D Center, University of Pittsburgh 

Azad M. Madni, Perceptronics 

Johanna Moore, University of Piltsburgh 

Elad Peled, Ben Gurion University 

Zimra Peled, Ben Gurion University 

James W. Pellegrino, Vandertnlt University 

Kathleen Swigger, Computer Science Department, Univer sity of North Texas 
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